18 Introduction to Statistics
Statistics is a fundamental aspect of business analytics, providing the tools and methods necessary to make sense of data, inform decision-making, and drive strategic actions. In the realm of business, statistics enables organizations to analyze trends, measure performance, and predict future outcomes, thus playing a pivotal role in achieving competitive advantage and operational efficiency.
Definition
Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It involves a wide range of techniques and methodologies used to gather insights from data, allowing for informed decision-making in various fields, including business, economics, healthcare, social sciences, and more.
In business analytics, the importance of statistics cannot be overstated. It helps businesses to:
- Make Data-Driven Decisions: By providing a solid foundation for making decisions based on empirical data rather than intuition or guesswork.
- Understand Market Trends: By analyzing market data to identify patterns, trends, and correlations that can inform strategic planning.
- Measure Performance: By evaluating business performance through key performance indicators (KPIs) and other metrics.
- Predict Future Outcomes: By using predictive models to forecast sales, demand, customer behavior, and other critical business factors.
- Optimize Operations: By analyzing operational data to improve efficiency, reduce costs, and enhance productivity.
18.1 Basic types of Statistics
Statistics is broadly divided into two main categories: descriptive statistics and inferential statistics.
-
Descriptive Statistics:
- Definition: Descriptive statistics involves summarizing and organizing data to describe the main features of a dataset. This type of statistics provides simple summaries and visualizations to make the data more comprehensible.
-
Key Techniques:
- Measures of Central Tendency: Mean, median, and mode, which describe the center of a data distribution.
- Measures of Dispersion: Range, variance, and standard deviation, which describe the spread of the data.
- Data Visualization: Charts, graphs, and tables that provide a visual representation of the data, such as bar charts, histograms, pie charts, and scatter plots.
-
Inferential Statistics:
- Definition: Inferential statistics involves making predictions or inferences about a population based on a sample of data. This type of statistics uses probability theory to estimate population parameters and test hypotheses.
-
Key Techniques:
- Estimation: Point estimates and confidence intervals that provide an estimate of population parameters.
- Hypothesis Testing: Techniques such as t-tests, chi-square tests, and ANOVA that assess whether there is enough evidence to support a specific hypothesis.
- Regression Analysis: Methods for modeling the relationship between a dependent variable and one or more independent variables, including linear regression and logistic regression.
Statistical Software and Tools
In the modern business environment, statistical analysis is often performed using specialized software and tools that facilitate data manipulation, analysis, and visualization. Some of the widely used statistical software in business analytics include:
- R: A programming language and software environment for statistical computing and graphics.
- Python: A versatile programming language with powerful libraries such as pandas, NumPy, and SciPy for statistical analysis.
- Excel: A spreadsheet software with built-in statistical functions and data analysis tools.
- SPSS: A software package used for statistical analysis in social science and business research.
- Tableau: A data visualization tool that helps create interactive and shareable dashboards.
- Statistics forms the backbone of business analytics, providing essential tools and methodologies for data analysis and interpretation.
- By leveraging statistical techniques, businesses can gain valuable insights, make informed decisions, and drive strategic initiatives.
- As the volume and complexity of data continue to grow, the role of statistics in business analytics will become increasingly important, underscoring the need for a solid understanding of statistical principles and practices.
18.2 Data Types based on structure
18.2.1 Structured Data:
Structured data refers to data that is organized in a predefined format or structure. This type of data is often stored in tabular form with rows and columns, making it easy to enter, store, query, and analyze. Structured data is typically found in databases and spreadsheets.
Characteristics of Structured Data:
Format: Organized in tables with rows and columns.
Data Types: Clearly defined data types (e.g., integers, strings, dates).
Schema: Follows a fixed schema or structure.
Accessibility: Easy to search and analyze using SQL and other query languages.
-
Examples:
- Databases (e.g., MySQL, Oracle, SQL Server)
- Spreadsheets (e.g., Microsoft Excel, Google Sheets)
- CSV files
Example of Structured Data:
A table of student information:
StudentID | Name | Age | Grade |
---|---|---|---|
001 | John Doe | 16 | A |
002 | Jane Roe | 17 | B |
003 | Sam Smith | 16 | A |
18.2.2 Unstructured Data:
Unstructured data refers to data that does not have a predefined format or organization. This type of data is often text-heavy, but it can also include multimedia content such as images, audio, and video. Unstructured data is more challenging to store, search, and analyze because it does not fit neatly into traditional databases.
Characteristics of Unstructured Data:
Format: Does not follow a specific format or structure.
Data Types: Diverse data types (e.g., text, images, audio, video).
Schema: No fixed schema; data can vary widely.
Accessibility: More complex to search and analyze, often requiring advanced tools and techniques like natural language processing (NLP) and machine learning.
-
Examples:
- Text documents (e.g., Word files, PDFs)
- Emails
- Social media posts
- Multimedia files (e.g., images, videos, audio files)
- Sensor data
Example of Unstructured Data:
- A collection of customer feedback emails
- Social media posts containing images and text
- Audio recordings of customer service calls
18.2.3 Semi-Structured Data:
While not as rigidly structured as traditional databases, semi-structured data contains tags or markers to separate data elements. This type of data does not conform to a rigid schema but does have some organizational properties that make it easier to analyze than unstructured data.
Characteristics of Semi-Structured Data:
- Format: Contains tags or markers.
- Data Types: Mix of structured and unstructured data elements.
- Schema: Flexible schema; can change over time.
- Accessibility: Easier to analyze than unstructured data but more complex than structured data.
-
Examples:
- JSON files
- XML files
- HTML files
- NoSQL databases (e.g., MongoDB)
Example of Semi-Structured Data:
A JSON object representing a book:
{
"title": "To Kill a Mockingbird",
"author": "Harper Lee",
"publication_year": 1960,
"genres": ["Fiction", "Drama"]
}
Summary:
- Structured Data: Highly organized, easily searchable (e.g., databases, spreadsheets).
- Unstructured Data: Lacks organization, diverse formats (e.g., text files, multimedia).
- Semi-Structured Data: Contains some organizational properties (e.g., JSON, XML).
Understanding these categories helps in choosing the right tools and techniques for data storage, processing, and analysis.
18.3 Data types based on measurement
There are only 2 classes of data in statistics: quantitative data and qualitative data. This highest level of classification comes from the fact that data can either be measured or can be an observed feature of interest.
Qualitative data are also referred to as categorical data. They are an observed phenomenon and cannot be measured with numbers. Examples: a race, age group, gender, origin, and so on. Even if they contain a numerical value, they hold no meaning (1 for male and 0 for female).
Quantitative data, on the other hand, tells us about the quantities of things or the things we can measure. And, so they are expressed in terms of numbers. It is also known as numerical data and includes statistical data analysis. Examples: height, water, distance, and so on.
We can further subdivide quantitative data and qualitative data into 4 subtypes as follows:
- Qualitative (Categorical) data types
- Nominal data, Ordinal data
- Quantitative (Numerical) Data Types
- Interval data, Ratio data
18.3.1 Qualitative (Categorical) data types
Qualitative data can be subdivided into nominal and ordinal data types. While both these types of data can be classified, ordinal data can be ordered as well.
i) Nominal Data
Nominal data is a type of data that represents discrete units which is why it cannot be ordered and measured. They are used to label variables without providing any quantitative value. Also, they have no meaningful zero.
Some examples of nominal data include
- Gender (Male, Female)
- Hair colour ( Black, Brown, Gray, etc)
- Nationality (Indian, American, Chinese, etc)
The only logical operation that you can apply to them is equality or inequality which you can also use to group them.
The descriptive statistics you can do with nominal data include frequencies, proportions, percentages, and central points. And, to visualize nominal data, you can use a pie chart or a bar chart.
Data anlayst use encoding, to transform nominal data into a numeric feature by assigning numerical values to it.
ii) Ordinal Data
Ordinal values represent discrete as well as ordered units. Unlike nominal, here the ordering matters. However, there is no consistency in the relative distance between the adjacent categories. And, similar to nominal data, ordinal data also don’t have a meaningful zero.
Examples of ordinal data
Opinion (agree, mostly agree, neutral, mostly disagree, disagree) Socioeconomic status (low income, middle income, high income) Data scientists use label encoding to transform ordinal data into a numeric feature.
- The descriptive statistics that you can do with ordinal data include frequencies, proportions, percentages, central points, percentiles, median, mode, and the interquartile range. Here the visualization methods that cabe used are the same as nominal data.
18.3.2 Quantitative (Numerical) Data Types
Two types of quantitative data are discrete data and continuous data.
Discrete data have distinct and separate values. Therefore, they are data with fixed points and can’t take any measures in between. So all counted data are discrete data. Some examples of discrete data include shoe sizes, number of students in class, number of languages an individual speaks, etc.
Continuous data, on the other hand, represent an endless range of possible values within a specified range. It can be divided into finer parts to be measured but not counted. Continuous data examples include temperature range, height, weight, etc.
- Continuous data can be visualized by histogram or box plot while bar graphs or stem plots can be used for discrete data.
Here are two types of continuous data
iii) Interval Data
It represents ordered data that is measured along a numerical scale with equal distances between the adjacent units. These equal distances are also referred to as intervals. So a variable contains interval data if it has ordered numeric values with the exact differences known between them.
Interval data can be continuous or discrete.
Examples of Interval data
IQ test’s intelligence scale Time if measured using a 12-hour clock Temperature in degree celsius You can compare the data with interval data and add/subtract the values but cannot multiply or divide as it doesn’t have a meaningful zero. The descriptive statistics you can apply for interval data include central point, range, and spread.
iv) Ratio Data
Like Interval data, ratio data are also ordered with the same difference between the individual units. However, they also have a meaningful zero so they cannot take negative values.
Examples of ratio data
Height ( zero is the starting point) Now with real zero points, we can also multiply and divide the numbers. Besides, you can sort the values as well. The descriptive statistics you can do with ratio data are the same as interval data and include central point, range, and spread.
Overall, ratio data and interval data are the same with equal spacing between adjoining values but the former also has a meaningful zero.
Besides addition and subtraction, you can also multiply and divide the data, which is impossible with interval data as it does not have an absolute zero.
However, interval data can take negative values with no absolute zero while ratio data cannot.
18.3.3 Detailed Comparison of Levels of Measurement
The table below presents a detailed comparison of the four main levels of measurement—Nominal, Ordinal, Interval, and Ratio—across various attributes, ensuring a comprehensive understanding of their distinct characteristics.
Attribute | Nominal | Ordinal | Interval | Ratio |
---|---|---|---|---|
Description | Simply labels used to identify an element. | Have properties of nominal data and rank or order. | Have properties of ordinal and the difference or the interval between the data values. Interval data can be given rank and also calculate the difference between any two values. | Have properties of Interval data, and additionally, they have a true zero point, allowing for the calculation of ratios. There is a true zero point, which indicates the absence of the quantity being measured, making ratio scales the most informative level of measurement. |
Examples | Sex, education, marital status, employee id, etc. | Excellent, good, poor, etc. | Exam scores, temperature, etc. | Height, weight, wages, production, etc. |
Numeric/Non-numeric | Can be both numeric and non-numeric | Can be both numeric and non-numeric | Always Numeric | Always Numeric |
Arithmetical Operations | Make no sense | Make no sense | Are meaningful (+, -) | Are meaningful (+, -, *, /) |
Zero Point | Has a value but does not indicate absence. | Zero value indicates that nothing exists for the variable at this point. | Has a value but does not indicate a true absence of the quantity. | Zero value indicates that nothing exists for the variable at this point, allowing for a full range of calculations. |
Scales of Measurement